HD28 

.M414 

no. 

^3 


WORKING  PAPER 
ALFRED  P.  SLOAN  SCHOOL  OF  MANAGEMENT 


A  Data  Consumer-based  Approach  to  Supporting 
Data  Quality  Judgement 


December  1992 


WP  #3516-93 
CISL  WP#  92-05 


Y.Jang 

H.  B.  Kon 

Richard  Y.  Wang 

Sloan  School  of  Management,  MIT 


MASSACHUSETTS 

INSTITUTE  OF  TECHNOLOGY 

50  MEMORIAL  DRIVE 

CAMBRIDGE,  MASSACHUSETTS  02139 


Published  in  the  Proceedings  of  the  Second  Annual  Workshop 

on  Information  Technology  and  Systems  (WITS) 

Dallas,  TX  December  1992 


A  Data  Consumer-based  Approach  to  Supporting 
Data  Quality  Judgement 

December  1992  WP  #3516-93 

CISL  WP#  92-05 

Y.  Jang 

H.  B.  Kon 

Richard  Y.  Wang 

Sloan  School  of  Management,  MIT 

*  see  page  bottom  for  complete  address 


Yeona  Jang  E53-317 

Henry  B.  Kon  E53-322 

Richard  Y.  Wang  E53-322 

Sloan  School  of  Management 

Massachusetts  Institute  of  Technology 

Cambridge,  MA  01239 


Published  in  the  Proceedings  of  the  Second  Annual  Workshop  on  Information  Technology  and  Systems 

(WITS)    Dallas,  Texas     December,  1992 


A  Data  Consumer-based  Approach 
to  Supporting  Data  Quality  Judgment 


Yoena  Jang 

Henry  B.  Kon 

Richard  Y.  Wang 

December  1992 

CISL-92-05 


Composite  Information  Systems  Laboratory 

E53-320,  Sloan  School  of  Management 

Massachusetts  Institute  of  Technology 

Cambridge,  Moss.  02139 

ATTN:    Professor  Richard  Wang 

Tel:  (617)  253^3442 

Fax:(617)734-2137 

e-mail:  rwang@eagle.mit.edu 

©  1992  Yeona  Jang,  Henry  B.  Kon  ,  and  Richard  Y.  Wang 


A  Knowledge-Based  Approach 
to  Assisting  In  Data  Quality  Judgment 

(Extended  Abstract) 
Yeona  Jang  Henry  B.  Kon  Richard  Y.  Wang 

Laboratory  for  Computer  Science  Sloan  School  of  Management  Sloan  School  of  Management 

Massachusetts  Institute  of  Technology  Massachusetts  Institute  of  Technology  Massachusetts  Institute  of  Technology 

Yeona®lcs.mit.edu  hkonSfmiLedu  rwangdmit.edu 

Abstract 

As  the  integration  of  information  systems  enables  greater  accessibility  to  data  from  multiple  sources,  the  issue 
of  data  quality  becomes  increasingly  important.  This  pap>er  attempts  to  formally  address  the  data  quality 
judgment  problem  with  a  knowledge-based  approach.  Our  analysis  has  identified  several  related  theoretical 
and  practical  issues.  For  example,  data  quality  is  determined  by  several  factors,  referred  to  as  quality 
parameters.  Quality  parameters  are  often  not  independent  of  each  other,  raising  the  issue  of  how  to  represent 
relationships  among  quality  parameters  and  reason  with  such  relationships  to  draw  insightful  knowledge 
about  the  overall  quality  of  data. 

In  particular,  this  paper  presents  a  data  quality  reasoner.  The  data  quality  reasoner  is  a  data  quality 
judgment  model  based  on  the  notion  of  a  "census  of  needs."  It  provides  a  framework  for  deriving  an  overall 
data  quaUty  value  from  local  relationships  among  quaUty  parameters.  The  data  quality  reasoner  will  assist 
data  consumers  in  judging  data  quality.  This  is  particularly  important  when  a  large  amount  of  data  involved 
in  decision-making  come  from  different,  imfamUiar  sources. 

1.  Introduction 

As  the  integration  of  information  systems  has  enabled  data  consumers  to  gain  access  to  both  familiar 
and  unfamiliar  data,  there  has  been  growing  interest  and  activity  in  the  area  of  data  quality.  Even  if 
each  individual  data  supplier  v^ere  to  guarantee  the  integrity  and  consistency  of  data,  data  from 
different  suppliers  may  still  be  of  different  quality  levels  —  due,  for  example,  to  different  data 
maintenance  policies.  Unfortunately,  as  demonstrated  in  studies  presented  in  the  literature  such  as 
[Bonoma,  1985;Bumham,  1985;Johnson,  1990;Laudon,  1986],  decisions  made  based  on  inaccurate  or  out-of- 
date  data  can  result  in  serious  economic  and  social  damage.  The  problem  of  data  quality  is  thus 
increasingly  critical. 

A  majority  of  previous  research  efforts  on  data  quality  has  focused  on  providing  to  data 
consumers  "meta-data,"  i.e.,  data  about  data,  that  can  facilitate  the  judgment  of  data  quality;  for 
example,  data  source,  creation  time,  and  collection  method.  We  refer  to  these  characteristics  of  the 
data  manufacturing  process  as  quaUty  indicators  (see  Table  1  for  examples  of  quality  indicators).  Data- 
quality  judgment  is  still,  however,  left  to  the  data  consumers.  Unfortunately,  information  overload 
makes  it  difficult  to  analyze  such  data  and  draw  useful  conclusions  about  data  quality.  This  paper 
seeks  to  assist  data  consumers  in  judging  if  the  quality  of  data  meets  his  or  her  requirements,  by 
reasoning  about  information  critical  to  data  quality  judgment. 

Regarding  data  quality,  this  paper  focuses  especially  on  the  problem  of  assessing  levels  of  data 
quality,  i.e.,  the  degree  to  which  data  meets  desired  characteristics  of  the  data  from  the  user's 
perspective.  In  considering  the  data  quality  assessment  problem,  our  analysis  has  identified  several 
theoretical  and  practical  issues: 

1)  What  are  data  quality  requirements? 

2)  How  can  relationships  between  dimensions  of  these  requirements  be  represented? 

3)  What  can  be  known  about  overall  data  quality  from  such  relationships,  and  how? 

The  study  conducted  on  major  US  firms,  in  [Wang  &  Guarrascio,  1991],  identified  a  relatively 
exhaustive  list  of  requirements,  such  as  timeliness  and  credibility.  Such  requirements  are  referred  to  as 
quality  parameters  in  this  paper  (see  Table  2  for  examples  of  data  quality  parameters).  Unfortunately, 
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requirements  of  data  depjend  to  largely  on  the  intended  usage  of  the  data.  For  example,  consider  patient 
records.  Availability  of  the  records  may  be  more  important  than  accuracy  to  hospital  administration, 
while  to  physicians  accuracy  is  as  important  as  availability  of  the  records  for  effective  patient 
management.  The  issue,  then,  is  how  to  deal  with  such  user-  or  application-specificity  of  quality- 
parameter  relationships.  This  paper  attempts  to  address  this  issue  with  a  knowledge-based  approach. 
This  raises  the  issue  of  how  to  represent  relationships  among  quality  parameters.  Another  important 
issue  is  how  to  reason  with  such  relationships  to  draw  insightful  knowledge  about  overall  data 
quality.  This  paper  focuses  mainly  on  addressing  the  last  two  issues:  representational  and  reasoning 
issues.  To  do  so,  we  assume  that  data  quality  parameters,  such  as  shown  in  Table  2,  are  available  for 
use. 


Table  1:  Data  Quality  Indicatore 

Indicator 

data  #1 

data  #2 

data  «3 

Source 

DB#1 

DB#2 

DB#3 

Creation-time 

6/11/92 

6/9/92 

6/2/92 

Update-frequency 

daily 

weekly 

monthly 

CoUecbon-method 

barcode 

entry  clerk 

radio  freq. 

Table  2:  Data  Quality  Parameters 

Parameter 

data  #1 

data  #2 

data  #3 

Credibility 

High 

Medium 

Medium 

Timeliness 

High 

Low 

Low 

Accuracy 

High 

Medium 

Medium 

The  mechanism  investigated  in  this  paper  is  the  data  quality  reasoner.  This  is  a  simple  data 
quality  judgment  model  based  on  the  notion  of  a  "census  of  needs."  It  applies  a  knowledge-based 
approach  in  data  quality  judgment.  The  intention  is  to  provide  flexibility  advantages  in  dealing  with 
the  subjective,  decision-analytic  nature  of  data  quality  judgment.  The  data  quality  reasoner  provides  a 
framework  for  representing  and  reasoning  with  local  relationships  among  quality  parameters  to 
produce  an  overall  data  quality  level.  Such  "informating"  ability  of  the  data  quality  reasoner  would 
have  significant  value  for  assisting  data  consumers  in  judging  data  quality,  particularly  when  data 
involved  in  decision-making  come  from  different,  unfamiliar  sources. 

1.1.    Quality  Indicators  and  Quality  Parameters 

It  is  worth  noting  relationship  between  quality  parameters  and  quality  indicators.  The  essential 
distinction  among  quality  indicators  and  quality  parameters  is  that  quality  indicators  are  intended 
(primarily)  to  represent  objective  information  about  the  data  manufacturing  process  [Wang  &  Kon, 
1992].  Quality  parameters,  however,  can  be  user-  or  application-specific,  and  are  derived  from  either 
underlying  quality  indicators  or  other  quality  parameters.  The  topology  of  the  "quality  hierarchy"  in 
this  paper  is:  a  single  quality  parameter  being  derived  from  n  underlying  quality  parameters.  Each 
underlying  quality  parameter,  in  turn,  could  be  derived  from  either  its  underlying  quality  parameters  or 
quality  indicators.  For  example,  a  user  may  conceptualize  quality  parameter  Credibility  as  one 
depending  on  underlying  quality  parameters  such  as  Source-reputation  and  Timeliness.  The  quality 
parameter  Source-reputation,  in  turn,  can  be  derived  from  quality  indicators  such  as  the  number  of  times 
that  a  source  supplies  obsolete  data.  This  pap>er  assumes  that  such  derivations  are  complete,  and  that 
relevant  quality  parameter  values  are  available. 


1J2.    Overview 

In  general,  several  quality  parameters  may  be  involved  in  determining  overall  data  quality.  This 
raises  the  issue  of  how  to  specify  the  degree  to  which  each  quality  parameter  contributes  to  overall 
data  quality.  One  approach  is  to  specify  the  degree,  in  certain  absolute  terms,  for  each  quality 
parameter.  It  may  not,  however,  be  practical  to  completely  specify  such  values.  Rather,  people  often 
conceptualize  local  relationships,  such  as  "Timeliness  is  more  important  than  the  credibility  of  a  source 
for  this  data,  except  when  timeless  is  low."  So  that,  if  timeliness  is  high  and  Source-credibility  is 
medium,  the  data  may  be  of  high  quality.  The  model  presented  in  this  paper  provides  a  formal 
specification  of  such  local  "dominance  relationships"  between  quality  parameters. 

The  issue  is,  then,  how  to  use  these  local  dominance  relationships  between  quality  parameters, 
and  what  can  be  known  about  data  quality  from  them.  Observe  that  each  local  relationship  between 
quality  parameters  specifies  the  local  relative  significance  of  quality  parameters.  One  way  to  use  local 
dominance  relationships  would  be  to  rank  and  enumerate  quality  parameters  in  the  order  of 
significance  implied  by  local  dominance  relationships.  Finding  a  total  ordering  of  quality  parameters 
consistent  with  local  relative  significance,  however,  can  be  computationally  intensive.  In  addition,  a 


complete  enumeration  of  quality  parameters  may  contain  too  much  information  to  convey  to  data 
consumers  any  insights  about  overall  data  quality.  This  paper  provides  a  model  to  help  data  consumers 
raise  their  levels  of  knowledge  about  the  data  they  use,  and  thus  make  informed  decisions.  Such  a 
process  represents  data  quality  filtering. 

Our  project  involves  an  investigation  of  a  data  quality  judgment  model,  with  the  aim  of  raising 
related  issues  and  describing  mechanisms  behind  the  use  of  knowledge  about  local  quality-parameter 
relationships  in  data  quality  judgment.  Section  2  discusses  a  representation  for  specifying  various  local 
relationships  between  quality  parameters.  Section  3  discusses  the  computational  component  of  the 
quality  judgment  model.  It  includes  a  mechanism  for  reasoning  with  local  dominance  relationships  to 
identify  information  critical  to  overall  data  quality.  Finally,  Section  4  summarizes  this  research  and 
suggests  future  directions  for  the  field  of  data  quality  evaluation. 

13.    Related  Work 

The  decision-analytic  approach,  as  summarized  in  [Keeney  &  Raiffa,  1976],  and  utility  analysis  under 
multiple  objectives,  as  summarized  in  [Chankong  &  Haimes,  1983],  describe  solution  approaches  for 
specifying  preferences  and  resolving  multiple  objectives.  The  preference  structure  of  a  decision  maker  or 
evaluator  is  specified  as  a  hierarchy  of  objectives.  Through  a  decomposition  of  objectives  using  either 
subjectively  defined  mappings  or  formal  utility  analyses,  the  hierarchy  can  be  reduced  to  an  overall 
value.  The  decision-analytic  approach  is  generally  built  around  the  presupposition  of  the  existence  of 
continuous  utility  functions.  The  approach  presented  in  this  paper,  on  the  other  hand,  does  not  require 
that  dominance  relations  between  quality  parameters  be  continuous  functions,  or  that  their  interactions 
be  completely  specified.  It  only  presupposes  that  some  local  dominance  relationships  between  quality 
parameters  exist. 

Representational  schemes  similar  to  one  presented  in  this  paper  are  investigated,  to  represent 
preferences,  in  sub  disciplines  of  Artificial  Intelligence  such  as  Planning  [Wellman,  1990,Wellman  & 
Doyle,  1991].  The  research  effort,  however,  has  focused  primarily  on  issues  involved  in  representing 
preferences,  and  much  less  so  on  computational  mechanisms  for  reasoning  with  such  knowledge. 

2.   Data  Quality  Reasoner 

This  section  discusses  the  data  quality  reasoner,  called  EX^R.  DQR  is  a  data  quality  judgment  model 
which  derives  an  overall  data  quality  value  for  a  particular  data  element,  based  on  the  following 
information: 

1)  A  set,  QP,  of  underlying  quality  parameters  that  affect  data  quality:  QP  =  {(Jj, (?2'  •••'  lj- 

2)  A  set,  DR,  of  local  dominance  relationships  between  quality  parameters  in  QP. 

In  particular,  this  paper  addresses  the  following  fundamental  issues  that  arise  in  considering 
the  use  of  local  relationships  between  quality  parameters  in  data  quality  judgment: 

1)  How  to  represent  local  dominance  relationships  between  quality  parameters. 

2)  What  to  do  with  such  local  dominance  relationships. 

Section  2.1  presents  a  representation  scheme  for  specifying  local  dominance  relationships  between 
quality  parameters  in  order  to  facilitate  data  quality  judgment.  Section  2.2  discusses  a  computational 
framework  which  exploits  such  relationships  to  draw  insights  about  overall  data  quality. 

2.1.   Representation  of  Local  Dominance  Relationships 

This  subsection  discusses  a  representation  of  local  dominance  relationships  between  quality  parameters. 
To  facilitate  further  discussion,  additional  notations  are  introduced  below.  For  any  quality  parameter 
q,,  let  symbol  Vj  denote  the  set  of  values  that  (j,  can  take  on.  In  addition,  the  following  notation  is  used 
to  describe  value  assignments  for  quality  parameters.  For  any  quality  parameter  (j,,  the  value 
assignment  (J,  .=  v  (for  example.  Timeliness  :=  High)  represents  the  instantiation  of  the  value  of  q,  as  v, 
for  some  v  in  V-  .  Value  assignments  for  quality  parameters,  such  as  (j,  .=  v,  are  called  "quality- 


parameter  value  assignments".  A  quality  parameter  with  a  particular  value  assigned  to  it  is  also 
referred  to  as  an  instantiated  quality  parameter. 

For  some  quality  parameters  (j,,  CJ2,...,(]„,  for  some  integer  n>\,  qjnq^r^.-.nq^  represents  a 
conjunction  of  quality  parameters.  Similarly,  (yj.=z;jn  '?2'=^2'^-"'^'?n'=^n'  ^^^  some  v,  in  V.,  for  all  ;'  = 
1,2,  ...  ,  and  n,  represents  a  conjunction  of  quality-parameter  value  assignments.  Note  that  the  symbol  n 
used  in  the  above  statement  denotes  the  logical  conjunction,  not  set  intersection,  of  events  asserted  by 
instantiating  quality  parameters. 

Finally,  notation  '©'  is  used  to  state  that  data  quality  is  affected  by  quality  parameters.  It  is 
represented  as  ®(£jjn(j2'^...'^(j„)  that  data  quality  is  affected  by  quality  parameters  qi,q2'  ■■■>  '^^^  ^n- 
Statement  ®(i^,  nij2n...nq^)  is  called  a  quality-merge  statement,  and  is  read  as  "the  quality  merge  of  qi_ 
^2.  ■■■,  and  q„."  Simpler  notation,  ®(^i,(?2'  •••' 'Jn)'  '^  also  used.  A  quality-merge  statement  is  said  to  be 
instantiated,  if  all  quality  parameters  in  a  quality-merge  statement  are  instantiated  to  certain  values. 
For  example,  statement  ®((7].=f,n  ^2'~^2'^ •••'^'?n'=^n''  '^  ^"  instantiated  quality-merge  statement  of 
®((jj  ^2'  •■■'  lJ'  ^'^^  some  y,  in  V,,  for  all  i  =  1,  2, ...,  and  n. 

The  following  defines  a  local  dominance  relationship  among  quality  parameters. 

Definition  1  (Dominance  relation):  Let  £j  and  Ej  be  two  conjunctions  of  quality-parameter  value 
assignments.  £j  is  said  to  dominate  £2,  denoted  by  £;  >i£2'  if  a"d  only  if  ®(£j  ,£2,+)  is  reducible  to 
®(£,,+),  where  "+"  stands  for  the  conjunction  of  value  assignments  for  the  rest  of  the  quality 
parameters,  in  QP,  which  are  shown  neither  in  Ej  nor  in  £2- 

Note  that  as  implied  by  "+,"  this  definition  assumes  the  context-insensitivity  of  reduction:  (£,,£2,+) 
can  be  reduced  to  ®(£j,+),  regardless  of  the  values  of  the  quality  parameters,  in  QP,  that  are  not 
involved  in  the  reduction.  Moreover,  "+"  implies  that  these  uninvolved  quality  parameters  in  QP 
remain  unaffected  by  the  application  of  reduction.  For  example,  consider  a  quality-merge  statement 
which  consists  of  quality  parameters  Source-credibility,  Interpretability,  Timeliness,  and  more. 
Suppose  that  when  Source-credibility  and  Timeliness  are  High,  and  Interpretability  is  Medium, 
Interpretability  dominates  the  other  two.  This  dominance  relationship  can  be  represented  as  follows: 

"Interpretability  :=  Medium  >jSource-Credibility  :=  High  nTimeliness  :=  High.  " 
Then,  ©(Source-credibility  :=  High,  Interpretability  :=  Medium,  Timeliness  :=  High,  +)  is  reducible  to 
quality-merge  statement  ®(Interpretability  :=  Medium,  +). 

As  mentioned  at  the  beginning  of  Section  2,  the  evaluation  of  the  overall  data  quality  for  a 
particular  data  element  requires  information  about  a  set  of  quality  parameters  that  play  a  role  in 
determining  the  overall  quality,  QP  =  (ijj  (^2,  ••,  (?„),  and  a  set  DR  of  local  dominance  relationships 
between  quality  parameters  in  QP.  Information  provided  in  QP  is  interpreted  by  DQR  as  "the  overall 
quality  is  the  result  of  quality  merge  of  quality  parameters  cj^,  q^'  ■••'  ^nd  q„,  i.e.,  ®(q^j^2---''ln^-"  Local 
dominance  relationships  in  DR  are  used  to  derive  an  overall  data  quality  value.  It  may  be  unnecessary 
or  impossible,  however,  to  explicitly  state  each  and  every  plausible  relationship  between  quality 
parameters  in  DR.  Assuming  incompleteness  of  preferences  in  quality  parameter  relationships,  this 
paper  approaches  the  incompleteness  issue  with  the  following  default  assumption:  For  any  two 
conjunctions  of  quality  parameters,  if  no  information  on  dominance  relationships  between  them  is 
available,  then  they  are  assumed  to  be  in  the  indominance  relation.  The  indominance  relation  is 
represented  as  follows: 

Definition  2  (Indominance  relation):  Let  £j  and  £2  be  two  conjunctions  of  quality-parameter  value 
assignments.  £j  and  £2  are  said  to  be  in  the  indominance  relation,  if  neither  £,  >^E2  nor  E^'^d^i 

When  two  conjunctions  of  quality  parameters  are  indominant,  a  data  consumer  may  specify  the  result  of 
quality  merge  of  them,  according  to  his  or  her  needs. 


22.  Reasoning  Component  of  DQR 

The  previous  subsection  discussed  how  to  represent  local  relationships  between  quality  parameters.  The 
next  question  that  arises  is  then  how  to  derive  overall  data  quality  from  such  local  dominance 
relationships,  i.e.,  how  to  evaluate  a  quality-merge  statement  based  on  such  relationships.  This  task, 
simply  referred  to  as  the  "data-quality-estimating  problem,"  is  summarized  as  follows: 


Data-Quality-Estimating  Problem: 

Let  DR  be  a  set  of  local  dominance  relationships  between  quality  parameters,  q^  (jj,  .,  and  (j„. 
Compute  ©((jj^2-''?n^'  subject  to  local  dominance  relationships  in  DR. 


An  instance  of  the  data-quality-estimating  problem  is  represented  as  a  list  of  a  quality-merge 
statement  and  a  corresponding  set  of  local  dominance  relationships,  i.e.,  (©((^2^2  ■•''?n^'  ^^)- 

The  rest  of  this  section  presents  a  framework  for  solving  the  data-quality-estimating  problem, 
based  on  the  notion  of  "reduction".  The  following  axiom  defines  the  data  quality  value  when  only  one 
quality  parameter  is  involved  in  quality  merge. 

Axiom  1  (Quality  Merge):  For  any  quality-merge  statement  ®(qj/f2'---''1n^'  if  m  =  1,  then  the  value  of 
®{q^xi2-..,q„)  is  equal  to  that  of  q^. 

Quality-merge  statements  with  more  than  one  quality  parameter  are  reduced  to  ones  with  a 
smaller  number  of  quality  parameters.  The  following  define  axioms  which  provide  a  basis  for  the 
reduction.  As  implied  by  Definition  1  and  the  default  assumption,  any  two  conjunctions  of  quality- 
parameter  value-assignments  can  be  in  either  the  dominance  relation  or  in  the  indominance  relation. 
The  following  axiom  specifies  that  any  two  conjunctions  cannot  be  both  in  the  dominance  relation  and  in 
the  indominance  relation. 

Axiom  2  (Mutual  Exclusivity):  For  any  two  conjunctions  £j  and  £3  of  quality-parameter  value 
assignments,  Ej  and  £2  are  related  to  each  other  in  exactly  one  of  the  following  ways: 

l.£j>,E2 
2.£2>,£, 
3.  £j  and  £2  are  in  the  indominance  relation. 

The  following  axiom  defines  the  precedence  of  the  dominance  relation  over  the  indominance 
relation.  This  implies  that  while  evaluating  a  quality-merge  statement,  quality  parameters  in  the 
dominance  relation  are  considered  before  those  not  in  the  dominance  relation. 

Axiom  3  (Precedence  of  >J:  The  dominance  relation  takes  precedence  over  the  indominance  relation. 

Reduction-Based  Evaluation:  A  reduction-based  evaluation  scheme  is  any  evaluation  process  where  the 
reduction  operations  take  precedence  over  all  other  evaluation  operations.  Definition  1  and  axiom  3 
allow  the  reduction-based  evaluation  strategy  to  be  used  to  solve  the  data-quality-estimating  problem 
for  quality-merge  statements  with  more  than  one  quality  parameter. 

The  use  of  dominance  relationships  to  reduce  a  quality-merge  statement  raises  the  issue  of 
which  local  dominance  relationships  to  apply  first,  i.e.,  regarding  the  order  in  which  local  dominance 
relationships  are  applied.  Unfortunately,  the  reduction  of  a  quality-merge  statement  is  not  always 
well-defined.  In  particular,  a  quality-merge  statement  can  be  reduced  in  more  than  one  way,  depending 
on  the  order  in  which  the  reduction  is  jjerformed.  For  example,  consider  an  instance  of  the  data- 
quality-estimating  problem,  (®(<?i,<?2''?3'^4''?5''?6)'  DR),  where  DR  consists  of  the  following  local 
dominance  relationships: 


Then,  the  quahty-merge  statement  ®('?j,^2''?3''?4''?5''?6^  '^^^  ^^  reduced  to  more  than  one  irreducible 
quality-merge  statement.  The  following  show  some  of  them: 

•  In  case  that  (jj  >j  (jj,  CI4  >j'?s,   and  ((Jjnij^)  >/q3r>q^)  are  applied  in  that  order, 
®(^i,(?2''?3''?4''?5''?6)  is  reducible  to  ®((j,,(j4),  as  follows: 

=  ®('?i'^j'<?4''?5'^6)'  by  applying  q^  >^q^ 

=  ®((?j,<?j,(?4''?6)' by  applying  <j4  >_^(j5, 

=  ®((?3,(?4),  by  applying  (^jn^j^)  >d((f3n(jg). 

•  In  case  that  c]^  >^13'  <?s  >d<?6'  and  ((?2'^%)  >/'?i'^^4)  ^""^  appHed  in  that  order, 
®('?i,(?2''?3''?4''?5''?6)  's  reducible  to  ®((J2,(J5)/  as  follows: 

®((?M2'fl3''74''?S''?6) 

=  ®(<?i,^2''?4''?S''?6)'  by  applying  (j^  >^q3, 

=  ®(<?  2  ,(?2  '^^4  '95 )'  by  applying  q^  >^q^ , 

=  ®<'?2''?5>'  by  applying  ((?2^'?5>  >/l:^1i^- 

As  illustrated  in  this  simple  example,  the  reduction  of  a  quality-merge  statement  is  not  always  well- 
defined. 

3.    First-Order  Data  Quality  Reasoner 

This  section  investigates  a  simpler  data  quality  reasoner  that  guarantees  the  well-defined  reduction  of 
quality-merge  statements,  by  making  certain  simplifying  assumptions.  To  facilitate  the  next  step  of 
derivation,  an  additional  definition  is  introduced. 

Definition  3  (First-Order  Dominance  Relation):  For  any  two  conjunctions  E,  and  Ej  of  quality  parameters 
such  that  £2  and  E^  are  in  the  dominance  relation,  £j  and  Ej  are  said  to  be  in  the  first-order 
dominance  relation,  if  each  of  Ej  and  E2  consists  of  one  and  only  one  quality  parameter. 

The  first-order  data  quality  reasoner,  in  short  called  DQRi ,  is  a  data  quality  judgment  model 
that  satisfies  the  following: 


First-order  Data  Quality  Reasoner  (DQRi ) 

1.  Axioms  1,  2,  and  3  hold. 

2.  Only  indominance  and  first-order  dominance  relationships  are  allowed. 

3.  <j  is  transitive  (i.e.,  transitive  dominance  relation). 


In  the  first-order  data  quality  reasoner,  higher-order  dominance  relationships,  such  as  'J2'^'?2'^'?3  -*d'?4 
orq^riq^  >d'?i'^'?2'  ^^^  ^^^  allowed.  In  addition,  the  first-order  data  quality  reasoner  requires  that  the 
dominance  relation  be  transitive.  This  implies  that  for  any  conjunctions  of  quality-parameter  value 
assignments,  Ei,  E2,  and  D,  if  Ej  >jE2  and  £2  >jE^,  then  Ej  >^  £3.  Transitivity  of  the  dominance  relation 
implies  the  need  for  an  algorithm  to  verify  that,  when  presented  with  an  instance  of  the  quality- 
estimating  problem  (®((j,,(j2,  ..,(?„),  DR),  dominance  relationships  in  OR  do  not  conflict  with  each 
other.  Well-known  graph  algorithms  can  be  used  for  performing  this  check  (T  H  Cormen,  Leiserson,  & 
Rivest,  1990). 

Quality-merge  statements  can  be  classified  into  groups,  with  respect  to  levels  of  the 
reducibility,  as  defined  below. 

Definition  4  (Irreducible  Quality-Merge  Statement):  For  any  instantiated  quality-merge  statement 
e=®((j2.=i;2,^2'=^2'"''?n'='^n)  such  that  n>2,  for  some  r,  in  V^,  *i  =  1,2,...,  and  n,  e  is  said  to  be 
irreducible,  if  for  any  pair  of  quality-parameter  value  assignments  in  e,  say  q,:=v^  and  qj:=Vj,  q,:=Vi 
and  <1f=Vj  are  in  the  indominance  relation.  Similarly,  any  quality-merge  statement  which  consists 
of  one  and  only  one  quality  parameter  is  said  to  be  irreducible. 


Definition  5  (Completely-Reducible  Quality-Merge  Statement):  For  any  instantiated  quality-merge 
statement  e=®icji:=Vi,q2:=V2,-..,q„:=v„)  such  that  n>2,  for  some  u,  in  V,,  '^i  =  1,2,...,  and  n,  e  is  said  to 
be  completely  reducible,  if  for  any  pair  of  quality-parameter  value  assignments  in  e,  say  (?,.=u,  and 
(j>=y ,  ^i-^f,  and  ff=Vj  are  in  the  dominance  relation. 

The  next  two  sub-sections  discuss  algorithms  for  evaluating  a  quality  merge  statement  in  DQRi  ■ 
This  process  is  diagrammed  in  Figure  1.  Algorithm  Q-Merge  is  the  top  level  algorithm  which  receives 
as  input  a  quality-merge  statement  and  the  corresponding  quality  parameter  relationship  set.  Within 
Algorithm  Q-Merge,  there  is  a  two  stage  process.  First,  the  given  quality-merge  statement  is 
instantiated  accordingly.  It  then  calls  Algorithm  Q-Reduction  to  reduce  the  QMS  into  its  corresponding 
irreducible  form. 


Quality  Merge  Statement 

E.G.  ©dnterpretability, 
Timelmess, 
Credibility) 

Quality  Parameter  Relationship  Set 

E.G.  {Interpretability  xi  Credibility, 
Timeliness  >d  Credibility, 
Interpretability  >d  Timeliness) 

'i 

Algorithm  Q-Merge 

1.  Instantiate  QMS 

2.  Call  Algorithm  Q-Reduction 

3.  Indominance  Processing 


QMS  ^^ /^Algorithm      N 
< 1   Q-Reduction     J 

IrredudbleV y 

QMS 


Overall  QMS  value 


Figure  1 .  The  Quality-Merge  Statement  (QMS)  Evaluation  Process 

3.1.   Reduction  of  Quality-Merge  Statement:  Algorithm  Q-Reduction 

This  section  describes  an  algorithm,  called  Q-Reduction,  for  reducing  a  quality-merge  statement  into  an 
irreducible  quality-merge  statement,  according  to  local  dominance  relationships  between  quality 
parameters.  This  subsection  continues  to  assume  that  all  dominance  relationships  are  first-order. 


Algorithm   Q-Reduction 

Input:  ie,  DR), 

where  e  =  ®(cij:=Vj,q2:=V2,—^„:=v„)  for  some  u,  in  V,,  '*i  =  1,2,...,  and  n,  and 

DR  is  a  set  of  local  dominance  relationships  between  quality  parameters  q^,q2,—,  and  q„. 

Output:  An  irreducible  quality-merge  statement  for  e. 

Let  Q  be  the  set  of  the  quality-parameter  value  assignments  in  e:  Q=  [qi:=Vi,q2:=V2,--,q„:=v„], 

1 .  LOOP  for  each  local  dominance  expression,  say  q,.=fl,  >^  '?/='';'  i^i  DR, 

2.  IF  ((j,.=fl,.  €  Q  )  and  ((K=fly  e  Q)     THEN  Q  ^  Q  -  {^  .=«  ) 

;;  The  irreducible  QMS  consists  of  the  quality  parameters  in  the  modified  ii  wnich  tne  loop  results  in. 

;;  Let  Q'  denote  the  final  modified  Q.  Tlien,  e  is  reducible  to  «'  which  consists  of  the  quality  parameters  in  Q'. 

3.  Return  e(Ji') 


Figure  2:  Algorithm  Q-Reduction 


Algorithm  Q-Reduction  in  Figure  2  takes  as  input  an  instantiated  quality-merge  statement  e 
and  DR,  and  returns  as  output  an  irreducible  quality-merge  statement  of  e.  The  instantiated  quality- 
merge  statement  e  is  reduced  as  follows. 


For  expository  purposes,  suppose  that  e  =  ®(qj:=Vj,q2:=V2,---,'^„:=v„),  for  some  v,  in  V,,  for  all  i  = 
1,2,...  ,  and  n,  and  let  Q  be  a  dynamic  set  of  quality-parameter  value  assignments,  which  is  initialized 
to  [q^:=v^,q2:=V2,---,q„:=v„].  For  any  pair  of  quality-parameter  value  assignments  q,:=Vj  and  ^/•'^^i  ^^  ^' 
if  (?,.=f ,  >j  '?;'=^i  is  a  memb)er  of  DR,  then  e  is  reducible  to  a  quality-merge  statement  with  the  quality 
parameters  in  il  less  '?,'=^,>  by  Definition  1.  This  allows  removing  (if=Vj  from  ii,  if  both  q,:=v,  and  '?,'=f, 
are  elements  in  Q..  Continue  the  process  of  removing  dominated  quality  parameters,  until  no  pair  of  the 
quality  parameters  in  Q  are  related  in  the  dominance  relation.  Let  Q'  denote  the  modified  D  produced 
at  the  end  of  this  removal  process.  The  quality  merge  of  the  quality  parameters  in  Q'  is  the 
corresp>onding  irreducible  quality-merge  statement  of  e,  and  the  algorithm  returns  ®(Q').  It  is  proven  in 
(Jang  &  Wang,  1991)  that  Algorithm  Q-Reduction  shown  in  Figure  2  always  results  in  a  unique  output  in 
the  first-order  data-quality  reasoner,  in  that  all  dominance  relations  must  be  first-order. 

3.2.    Algorithm  Q-Merge 

When  presented  with  an  instance  of  the  quality-estimating  problem  {®{qj,q2,...,q„),  DR)  for 
some  integer  n,  Algorithm  Q-Merge  first  instantiates  the  given  quality-merge  statement,  accordingly. 
The  instantiated  quality-merge  statement  is  then  reduced  until  the  reduction  process  results  in  another 
instantiated  quality-merge  statement  which  cannot  be  reduced  any  further  (using  Q-Reduction).  This 
raises  the  issue  of  how  to  evaluate  an  irreducible  quality-merge  statement. 

Unfortunately,  the  evaluation  of  an  irreducible  quality-merge  statement  is  not  always  well- 
defined.  When  evaluating  an  irreducible  quality-merge  statement,  the  number  of  orders  in  which  the 
quality  merge  operation  can  be  applied  grows  exponentially  with  the  number  of  quality  parameters  in 
the  statement.  In  particular,  certain  quality-merge  statements  may  be  merged  in  more  than  one  way, 
depending  on  the  order  in  which  the  merge  is  performed.  It  is  possible  that  this  set  might  include  every 
element  of  V-s.  This  paper  evades  this  problem  by  presenting  quality-parameter  value  assignments  in 
the  irreducible  quality-merge  statement  returned  by  Algorithm  Q-Reduction  so  that  a  user  may  use  this 
information  presented,  according  to  his  or  her  needs.  Figure  3  summarizes  Algorithm  Q-Merge. 


Algorithm   Q-Merge 

Input:  (e,  DR), 

where  e  =  ®((jj,(j2 '■■■'<?„)/  for  some  integer  n  and 

DR  is  a  set  of  local  dominance  relationships  between  quality  parameters  qj,q2,—,  and  q„. 
Output:  Overall  data  quality  value  produced  by  evaluating  e. 

1.  Instantiate  e. 

;;  Suppose  that  q^^^,...,  and  q^  are  instantiated  as  v^,v^,...,  and  r^,  respectively,  for  some  nin  V^ 
y,  for  all  1  =  1,2, ...,  and  n. 

2.  e'   <-    IF(n  =  l)  THEN  e 

ELSE  Q-Reduction(®((?j.-y,,(?2.=P2 q„-=v„)  ,  DR) 

3.  Present  quality-value  assignments  in  ef . 


Figure  3:  Algorithm  Q-Merge 

4.   Discussion 

We  have  presented  a  knowledge-based  framework  for  data  quality  judgment  that:  (1)  allows 
specifying  local  dominance  relationships  between  quality  parameters,  (2)  performs  the  reduction  of 
quality-merge  statements,  and  (3)  derives  a  value  for  overall  data  quality.  A  knowledge-based 
approach  was  applied  to  data  quality  judgment,  to  provide  significant  flexibility  advantages  in 
representing  data-consumer-specific  requirements  on  data,  and  thereby  tailoring  data  quality  judgment 
to  data  consumers'  needs. 

In  addition,  our  analysis  has  identified  issues  that  must  be  addressed  in  order  for  the  quality 
judgment  model  presented  in  this  paper  to  be  of  practical  use.  The  rest  of  this  section  considers  the 


limitations  of  the  approach  explored  in  this  paper,  and  suggests  future  directions  for  the  field  of  data 
quality  judgment. 

Higher-order  data  quality  reasoner:  The  problem  associated  with  the  reduction  of  quality-merge 
statements  was  discussed  in  Section  2.2.  The  first-order  data  quality  reasoner  evades  the  problem  of 
ill-defined  reduction  by  prohibiting  higher  order  relationships.  Real-world  problems,  however, 
often  involve  more  complex  relationships  than  first-order  relationships  between  quality 
parameters.  In  order  to  deal  with  higher-order  relationships,  both  the  representational  and 
algorithmic  components  of  the  first-order  data  quality  reasoner  would  need  to  be  extended. 

Data  AcquisitionyA  hierarchy  of  quality  indicators  and  parameters:  This  research  assumed  that 
values  of  quality  parameters  are  available  so  that  quality-merge  statements  can  be  instantiated 
propterly.  Issues  of  how  to  represent  and  how  to  streamline  such  values  to  the  data  quality  reasoner, 
however,  must  be  addressed.  One  approach  to  these  issues  would  be  to  organize  quality  parameters 
and  quality  indicators  in  a  hierarchy.  Then,  to  each  data  element  or  type  can  be  attached 
information  about  how  to  compute  a  value  of  a  quality  parameter.  Such  a  hierarchy  would  allow 
the  derivation  of  a  quality  parameter  value  from  its  underlying  quality  parameters  and  quality 
indicators.  A  tool  for  automatically  constructing  such  a  hierarchy  and  computing  quality- 
parameter  values  would  enhance  the  utility  of  the  data  quality  reasoner. 

Knowledge  Acquisition:  The  capability  of  using  local  dominance  relationships,  which  are  typically 
user-  or  application-specific,  allows  us  to  build  systems  more  adaptable  to  customers'  needs.  As 
application  domains  are  complex,  however,  it  becomes  increasingly  difficult  to  state  all  the 
relationships  that  must  be  known.  Such  knowledge  acquisition  bottlenecks  could  be  alleviated 
through  development  of  a  computer  program  for  guiding  the  process  of  acquiring  relationships 
between  quality  parameters. 

User  interface  for  cooperative  problem  solving:  As  mentioned  in  Section  3,  the  evaluation  of  an 
irreducible  quality-merge  statement  is  not  well-defined.  Different  orders  in  which  quality 
parameters  in  an  irreducible  quality-merge  statement  are  evaluated  may  result  in  different  values. 
This  research  dealt  with  the  need  to  evaluate  an  irreducible  quality-merge  statement  by  simply 
presenting  information  on  quality  parameters  in  an  irreducible  quality-merge  statement. 
Development  of  a  user  interface  which  allows  evaluating  irreducible  quality-merge  cooperatively 
with  a  data  consumer  could  lessen  the  problem. 

In  the  continuous  cycle  of  measurement,  analysis,  and  improvement  for  data  quality 
management,  it  is  crucial  that  a  methodology  be  developed  for  judging  data  quality.  In  particular, 
while  each  individual  data  supplier  may  maintain  integrity  and  consistency  of  its  own  data,  such 
local  integrity  and  consistency  do  not  necessarily  guarantee  that  data  from  different  suppliers  display 
the  same  level  of  quality.  The  development  of  a  system  that  can  assist  data  consumers  in  judging  if  data 
meets  their  requirements  is  important,  particularly  when  decision-making  involves  data  from 
different,  foreign  sources.  The  model  presented  in  this  paper  provides  a  first  step  toward  such  a  system. 
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